This project explores Video Games Sales with Ratings dataset from Kaggle.
It is a combination of data obtained from web scrape of VGChartz Video Games Sales and a web scrape from Metacritic that provides games rating. There are some missing observations as Metacritic only covers a subset of the platforms. There are approximately 6,900 complete cases.
The dataset has 16719 observations and 16 variables:
## [1] 16719 16
Let’s make the summary of the data to see if there are any missing values.
## Name Platform Year_of_Release
## Need for Speed: Most Wanted: 12 PS2 :2161 2008 :1427
## FIFA 14 : 9 DS :2152 2009 :1426
## LEGO Marvel Super Heroes : 9 PS3 :1331 2010 :1255
## Madden NFL 07 : 9 Wii :1320 2007 :1197
## Ratatouille : 9 X360 :1262 2011 :1136
## Angry Birds Star Wars : 8 PSP :1209 2006 :1006
## (Other) :16663 (Other):7284 (Other):9272
## Genre Publisher
## Action :3370 Electronic Arts : 1356
## Sports :2348 Activision : 985
## Misc :1750 Namco Bandai Games : 939
## Role-Playing:1500 Ubisoft : 933
## Shooter :1323 Konami Digital Entertainment: 834
## Adventure :1303 THQ : 715
## (Other) :5125 (Other) :10957
## NA_Sales EU_Sales JP_Sales Other_Sales
## Min. : 0.0000 Min. : 0.000 Min. : 0.0000 Min. : 0.00000
## 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 0.00000
## Median : 0.0800 Median : 0.020 Median : 0.0000 Median : 0.01000
## Mean : 0.2633 Mean : 0.145 Mean : 0.0776 Mean : 0.04733
## 3rd Qu.: 0.2400 3rd Qu.: 0.110 3rd Qu.: 0.0400 3rd Qu.: 0.03000
## Max. :41.3600 Max. :28.960 Max. :10.2200 Max. :10.57000
##
## Global_Sales Critic_Score Critic_Count User_Score
## Min. : 0.0100 Min. :13.00 Min. : 3.00 :6704
## 1st Qu.: 0.0600 1st Qu.:60.00 1st Qu.: 12.00 tbd :2425
## Median : 0.1700 Median :71.00 Median : 21.00 7.8 : 324
## Mean : 0.5335 Mean :68.97 Mean : 26.36 8 : 290
## 3rd Qu.: 0.4700 3rd Qu.:79.00 3rd Qu.: 36.00 8.2 : 282
## Max. :82.5300 Max. :98.00 Max. :113.00 8.3 : 254
## NA's :8582 NA's :8582 (Other):6440
## User_Count Developer Rating
## Min. : 4.0 :6623 :6769
## 1st Qu.: 10.0 Ubisoft : 204 E :3991
## Median : 24.0 EA Sports: 172 T :2961
## Mean : 162.2 EA Canada: 167 M :1563
## 3rd Qu.: 81.0 Konami : 162 E10+ :1420
## Max. :10665.0 Capcom : 139 EC : 8
## NA's :9129 (Other) :9252 (Other): 7
There are some missing values in Critic_Score, Critic_Count, User_Score and User_Count. Let’s remove the rows with the missing values.
## Name Platform
## Madden NFL 07 : 9 PS2 :1161
## LEGO Star Wars II: The Original Trilogy : 8 X360 : 881
## Need for Speed: Most Wanted : 8 PS3 : 790
## Harry Potter and the Order of the Phoenix : 7 PC : 703
## LEGO Batman: The Videogame : 7 XB : 581
## LEGO Indiana Jones: The Original Adventures: 7 Wii : 492
## (Other) :6971 (Other):2409
## Year_of_Release Genre Publisher
## 2008 : 595 Action :1677 Electronic Arts : 957
## 2007 : 590 Sports : 973 Ubisoft : 500
## 2005 : 562 Shooter : 886 Activision : 498
## 2009 : 554 Role-Playing: 721 Sony Computer Entertainment: 316
## 2006 : 528 Racing : 598 THQ : 309
## 2003 : 499 Platform : 407 Nintendo : 294
## (Other):3689 (Other) :1755 (Other) :4143
## NA_Sales EU_Sales JP_Sales Other_Sales
## Min. : 0.0000 Min. : 0.0000 Min. :0.00000 Min. : 0.00000
## 1st Qu.: 0.0600 1st Qu.: 0.0200 1st Qu.:0.00000 1st Qu.: 0.01000
## Median : 0.1500 Median : 0.0600 Median :0.00000 Median : 0.02000
## Mean : 0.3893 Mean : 0.2331 Mean :0.06295 Mean : 0.08153
## 3rd Qu.: 0.3900 3rd Qu.: 0.2100 3rd Qu.:0.01000 3rd Qu.: 0.07000
## Max. :41.3600 Max. :28.9600 Max. :6.50000 Max. :10.57000
##
## Global_Sales Critic_Score Critic_Count User_Score
## Min. : 0.0100 Min. :13.00 Min. : 3.00 7.8 : 298
## 1st Qu.: 0.1100 1st Qu.:62.00 1st Qu.: 14.00 8 : 267
## Median : 0.2900 Median :72.00 Median : 24.00 8.2 : 267
## Mean : 0.7671 Mean :70.25 Mean : 28.78 8.5 : 245
## 3rd Qu.: 0.7500 3rd Qu.:80.00 3rd Qu.: 39.00 7.5 : 240
## Max. :82.5300 Max. :98.00 Max. :113.00 7.9 : 240
## (Other):5460
## User_Count Developer Rating
## Min. : 4.0 EA Canada : 152 T :2420
## 1st Qu.: 11.0 EA Sports : 145 E :2118
## Median : 27.0 Capcom : 128 M :1459
## Mean : 173.4 Ubisoft : 104 E10+ : 946
## 3rd Qu.: 89.0 Konami : 100 : 70
## Max. :10665.0 Ubisoft Montreal: 88 RP : 2
## (Other) :6300 (Other): 2
Let’s check the data types of the columns to make sure they are correct.
## 'data.frame': 7017 obs. of 16 variables:
## $ Name : Factor w/ 11563 levels "","'98 Koshien",..: 11059 5573 11061 6693 11057 6696 5572 11051 4966 11052 ...
## $ Platform : Factor w/ 31 levels "2600","3DO","3DS",..: 26 26 26 5 26 26 5 26 29 26 ...
## $ Year_of_Release: Factor w/ 40 levels "1980","1981",..: 27 29 30 27 27 30 26 28 31 30 ...
## $ Genre : Factor w/ 13 levels "","Action","Adventure",..: 12 8 12 6 5 6 8 12 5 12 ...
## $ Publisher : Factor w/ 582 levels "10TACLE Studios",..: 371 371 371 371 371 371 371 371 330 371 ...
## $ NA_Sales : num 41.4 15.7 15.6 11.3 14 ...
## $ EU_Sales : num 28.96 12.76 10.93 9.14 9.18 ...
## $ JP_Sales : num 3.77 3.79 3.28 6.5 2.93 4.7 4.13 3.6 0.24 2.53 ...
## $ Other_Sales : num 8.45 3.29 2.95 2.88 2.84 2.24 1.9 2.15 1.69 1.77 ...
## $ Global_Sales : num 82.5 35.5 32.8 29.8 28.9 ...
## $ Critic_Score : int 76 82 80 89 58 87 91 80 61 80 ...
## $ Critic_Count : int 51 73 73 65 41 80 64 63 45 33 ...
## $ User_Score : Factor w/ 97 levels "","0","0.2","0.3",..: 79 82 79 84 65 83 85 76 62 73 ...
## $ User_Count : int 322 709 192 431 129 594 464 146 106 52 ...
## $ Developer : Factor w/ 1697 levels "","10tacle Studios",..: 1035 1035 1035 1035 1035 1035 1035 1035 621 1035 ...
## $ Rating : Factor w/ 9 levels "","AO","E","E10+",..: 3 3 3 3 3 3 3 3 3 3 ...
User Score and Year of Release are factors and should be converted to numeric.
## [1] "1980" "1981" "1982" "1983" "1984" "1985" "1986" "1987" "1988" "1989"
## [11] "1990" "1991" "1992" "1993" "1994" "1995" "1996" "1997" "1998" "1999"
## [21] "2000" "2001" "2002" "2003" "2004" "2005" "2006" "2007" "2008" "2009"
## [31] "2010" "2011" "2012" "2013" "2014" "2015" "2016" "2017" "2020" "N/A"
There are also some missing values in the Year of Release. Let’s remove those values and then convert User Score and Year of Release to numeric data type.
## 'data.frame': 6894 obs. of 16 variables:
## $ Name : Factor w/ 11563 levels "","'98 Koshien",..: 11059 5573 11061 6693 11057 6696 5572 11051 4966 11052 ...
## $ Platform : Factor w/ 31 levels "2600","3DO","3DS",..: 26 26 26 5 26 26 5 26 29 26 ...
## $ Year_of_Release: num 2006 2008 2009 2006 2006 ...
## $ Genre : Factor w/ 13 levels "","Action","Adventure",..: 12 8 12 6 5 6 8 12 5 12 ...
## $ Publisher : Factor w/ 582 levels "10TACLE Studios",..: 371 371 371 371 371 371 371 371 330 371 ...
## $ NA_Sales : num 41.4 15.7 15.6 11.3 14 ...
## $ EU_Sales : num 28.96 12.76 10.93 9.14 9.18 ...
## $ JP_Sales : num 3.77 3.79 3.28 6.5 2.93 4.7 4.13 3.6 0.24 2.53 ...
## $ Other_Sales : num 8.45 3.29 2.95 2.88 2.84 2.24 1.9 2.15 1.69 1.77 ...
## $ Global_Sales : num 82.5 35.5 32.8 29.8 28.9 ...
## $ Critic_Score : int 76 82 80 89 58 87 91 80 61 80 ...
## $ Critic_Count : int 51 73 73 65 41 80 64 63 45 33 ...
## $ User_Score : num 79 82 79 84 65 83 85 76 62 73 ...
## $ User_Count : int 322 709 192 431 129 594 464 146 106 52 ...
## $ Developer : Factor w/ 1697 levels "","10tacle Studios",..: 1035 1035 1035 1035 1035 1035 1035 1035 621 1035 ...
## $ Rating : Factor w/ 9 levels "","AO","E","E10+",..: 3 3 3 3 3 3 3 3 3 3 ...
Let’s look at the distributions of some of the variables.
Distribution of Global Sales
Global Sales distribution is a long tail one, but once converted to a logarithmic scale, it looks like normal distribution.
Let’s look at the sales distribution by region.
Global Sales by Region
Looking at the distribution of sales by region, it seems that the dataset consists of games that are mostly sold in the North American market (which makes sense since the subset of games we are looking at includes only the games that have rating on Metacritic.com which is a primarily American audience website).
Another observation is that many games have sales close to 0 in the markets outside of the US, which is represented by a high vertical bar on the left of the histograms.
Distribution by Year of Release
Most of the games in the dataset were released between 2000 and 2015. However, games with the highest median global sales were released before 2000.
Distribution by Genre
The most represented games genre in the dataset is Action, followed by Sports and Shooter. In terms of sales the most popular genres are Sports and Miscellaneous, followed by Platform, Shooter and Fighting.
Distribution by Platform
Sony consoles are leading in terms of the median sales per game (PS, PS3 and PS2). There is no clear relationship between the amount of games produced per platform and the median amount sold for this platform. For example the newest consoles from Nintendo (WiiU) and Microsoft (XOne) don’t have a lot of games released yet, but the median sales per game are quite high. Whereas PC games are abundant, but are generating very little sales (one possible explanation is that PC games are more prone to being pirated).
Let’s add a new variable called “Bestseller” for games that sold a million or more copies. Let’s look at the top Publishers and Developers in terms of total sales and see how many bestseller games they have in their portfolio.
Top 5 game publishers in terms of total Global Sales
## # A tibble: 5 x 6
## Publisher total_sales median_sales n bestsellers best_share
## <fct> <dbl> <dbl> <int> <dbl> <dbl>
## 1 Electronic Arts 869. 0.53 945 273 28.9
## 2 Nintendo 850. 1.03 293 149 50.8
## 3 Activision 536. 0.45 492 125 25.4
## 4 Sony Computer Ent~ 388. 0.46 316 92 29.1
## 5 Take-Two Interact~ 350. 0.44 273 77 28.2
The biggest game Publishers are Electronic Arts and Nintendo, each having sold more than 800 million game copies. While Electronic Arts stands out for the amount of published games (945), Nintendo has published far fewer games (293), but sold a median of twice as much copies per title. This also holds for the amount of bestsellers: while each of the top 5 publishers except for Nintendo has a bestseller raio of 25 to 29%, Nintendo’s portfolio consists of 51% of bestseller games.
Top 5 game developers in terms of total Global Sales
## # A tibble: 5 x 6
## Developer total_sales median_sales n bestsellers best_share
## <fct> <dbl> <dbl> <int> <dbl> <dbl>
## 1 Nintendo 530. 3.23 68 52 76.5
## 2 EA Sports 146. 0.6 142 46 32.4
## 3 EA Canada 131. 0.48 149 41 27.5
## 4 Rockstar North 119. 7.99 14 11 78.6
## 5 Capcom 115. 0.36 126 34 27.0
Top developers are Nintendo and Electronic Arts (EA Sports and EA Canada are both divisions of Electronic Arts). What stands out is that Nintendo is even more successful as a developer than it is as a publisher, having a median of 3 million sales per title and 76% of bestsellers in the portfolio. Another developer that stands out for its bestseller rate is Rockstar North, the developer of Grand Theft Auto franchise. With only 14 games developed, 11 of them became bestsellers (79%), bringing the company a median of 8 million copies sold per game title.
Distribution of Critic Scores and User Scores
Let’s analyse the distribution of Critic Scores and User Scores.
Critic Score Summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 62.00 72.00 70.26 80.00 98.00
User Score Summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 64.00 74.00 70.84 81.00 95.00
Both Critic Score and User Score distributions are skewed left. What is curious is while on average User Scores are slightly more positive, at the same time Critics tend to give more extremely positive scores, and Users tend to give more extremely negative scores (long tail of the distribution).
Initially the dataset consisted of 16719 observations and 16 variables.
However, due to a number of missing variables, part of the observations was removed and the final amount of observations with full data is that of 6894.
There were also a number of adjustments to data types that will permit to run further analysis smoothly.
The following additional variables were created:
The data can be used to predict either the amount of game sales or whether a specific game will become a bestseller or not. Depending on the problem formulation, the target variable can be either the amount of copies sold (Global_Sales), or Bestseller (in this case the target would be binary, a game is either a Bestseller or not).
Global Sales has a long tail distribution, which is why graphs that include sales will be represented on a logarithmic scale.
The potential predictor variables are:
The dataset covers games released mostly between 2000 and 2015. The data itself was last updated in december 2016.
In this section we will analyse more in detail possible relationships that can exist among different variables explored in the first section.
As we saw in the previous section, some genres have higher global sales than others. We can also see that some genres are prone to more variance (for example Simulation), whereas others are less widespread (Adventure).
## [[1]]
## NULL
##
## $Action
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.1200 0.2900 0.7334 0.7300 21.0400
##
## $Adventure
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.0600 0.1300 0.3088 0.2900 5.5400
##
## $Fighting
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.1300 0.3300 0.6597 0.7700 12.8400
##
## $Misc
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.1700 0.3800 1.0806 0.9875 28.9200
##
## $Platform
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.1200 0.3500 0.9375 0.9450 29.8000
##
## $Puzzle
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.0800 0.1400 0.6686 0.5600 15.2900
##
## $Racing
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.1100 0.2700 0.8167 0.7700 35.5200
##
## $`Role-Playing`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.1000 0.2600 0.7023 0.7000 9.7200
##
## $Shooter
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.1100 0.3400 0.9412 0.9025 14.7300
##
## $Simulation
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.0800 0.3000 0.6763 0.7175 12.1300
##
## $Sports
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.1600 0.3800 0.8787 0.8550 82.5300
##
## $Strategy
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.0400 0.0900 0.2521 0.2775 4.8400
Let’s see whether genre preferences stay the same if we split the sales data by market.
Whereas North American and European markets are somewhat similar in terms of best selling genres, Japanese market seems to show different tendencies. The most selling game genres are Role-Playing and Puzzle, whereas the worst selling ones are Racing, Sports, Strategy and Shooter.
Let’s look at the top selling titles per region.
Top 5 most sold titles in North America
## Name Genre NA_Sales Global_Sales
## 1 Wii Sports Sports 41.36 82.53
## 2 Mario Kart Wii Racing 15.68 35.52
## 3 Wii Sports Resort Sports 15.61 32.77
## 4 Kinect Adventures! Misc 15.00 21.81
## 5 New Super Mario Bros. Wii Platform 14.44 28.32
Top 5 most sold titles in Europe
## Name Genre EU_Sales
## 1 Wii Sports Sports 28.96
## 2 Mario Kart Wii Racing 12.76
## 3 Wii Sports Resort Sports 10.93
## 4 Brain Age: Train Your Brain in Minutes a Day Misc 9.20
## 5 Wii Play Misc 9.18
## Global_Sales
## 1 82.53
## 2 35.52
## 3 32.77
## 4 20.15
## 5 28.92
Top 5 most sold titles in Japan
## Name Genre JP_Sales
## 1 New Super Mario Bros. Platform 6.50
## 2 Animal Crossing: Wild World Simulation 5.33
## 3 Brain Age 2: More Training in Minutes a Day Puzzle 5.32
## 4 New Super Mario Bros. Wii Platform 4.70
## 5 Animal Crossing: New Leaf Simulation 4.39
## Global_Sales
## 1 29.80
## 2 12.13
## 3 15.29
## 4 28.32
## 5 9.16
While the top titles in the US and Europe are almost the same, Japan has a very different list. The genres of top titles also very significantly, whereas in the US and Europe Sports and Racing make the top of the list, in Japan it is Platform and Simulation genres.
What about acceptance by critics and user by genre?
It looks like the genres preferred by Critics and Users are not necessarily the best selling ones. For instance, Puzzle genre is getting comparatively high critic scores, but is not selling well. Strategy is a genre receiving one of the best meadian scores by users, but is one of the worst in terms of sales (one posiible explanation for that is that Strategy games are more common on PC and as we saw earlier, PC games are among the worst selling ones).
Proportion of bestsellers per genre
## # A tibble: 12 x 6
## Genre total_sales median_sales n bestsellers best_share
## <fct> <dbl> <dbl> <int> <dbl> <dbl>
## 1 Misc 417. 0.38 386 95 24.6
## 2 Shooter 817. 0.34 868 203 23.4
## 3 Platform 378. 0.35 403 94 23.3
## 4 Sports 836. 0.38 951 195 20.5
## 5 Racing 479. 0.27 586 117 20.0
## 6 Fighting 250. 0.33 379 75 19.8
## 7 Action 1206. 0.290 1644 309 18.8
## 8 Puzzle 78.9 0.14 118 21 17.8
## 9 Simulation 204. 0.3 302 53 17.6
## 10 Role-Playing 502. 0.26 715 123 17.2
## 11 Adventure 81.5 0.13 264 16 6.06
## 12 Strategy 70.1 0.09 278 15 5.4
Top genres with the highest proportion of bestsellers are mostly inline with the previous findings about the best selling genres, top ones being Miscellaneous, Shooter and Platform.
There seems to be a positive correlation between Critic Score and Global Sales. The relationship is not exactly linear, there seems to be a slight explonential curve.
The relationship between User Score and Global Sales is also slightly positive, although much less pronounced than the relationship between Critic Score and Global Sales.
Let’s calculate Pearson correlation coefficient for Critic Score, User Score and Global Sales.
## Global_Sales Critic_Score User_Score
## Global_Sales 1.00 0.24 0.09
## Critic_Score 0.24 1.00 0.58
## User_Score 0.09 0.58 1.00
As concluded earlier from the scatterplots, Critic Score has higher correlation (0.24) with Global Sales than User Score (0.09), making it a more useful metric to add to the prediction model.
What is more, Critic Score and User Score seem to be positively correlated with one another (0.58), so User Score should probably be removed from the model to avoid multicollinearity.
Let’s looking at the distribution of Critic Scores depending on whether the game is a bestseller ot not.
## $`0`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 60.00 70.00 68.08 78.00 98.00
##
## $`1`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20.00 74.00 81.00 79.49 87.00 98.00
Best selling games tend to receive higher Critic Score (an average score of 80 compared to an average of 68 for not bestsellers). The distribution of Critic Scores is less widespread for bestsellers (there is more unanimity among Critics when it comes to best selling games).
Some genres seem to be selling better than others. However, it is important to take into account that genre preferences may vary depending on the region. This is especially true for Japanese market, that clearly has different genre preferences compared to the North American or European markets.
The feature with the strongest correlation with the target variable seems to be the critic score. This is true both when we look at Global Sales as a target variable (slight positive correlation, non-linear relationship), as well as at Bestseller target variable (games that are bestsellers tend to have higher critic scores).
User score, on the other hand, has weaker correlation with Global Sales, even though it might seem contra-intuitive, since it’s the end users after all who buy games.
Critic score and user score are postively correlated with each other. This relationship has to be taken into account when building a predictive model, since it can be a potential cause of multicollinearity.
Now that we know that there is a certain correlation between critic score and sales, as well as there are some regional preferences for genres, let’s look at some other factors that might play role in creating a bestseller game.
We can suppose that companies that are creating games are gradually becoming better at it, so the more games they release, the higher the sales per game.
Another possible assumptions is that if a game became a bestseller, making use of the same franchise can be a factor for success, since users are already familiar with the brand and are more likely to buy the game if they liked the previous one of the serie.
Mario Franchise was definitely a big success, with most of the released games gaining more than average global sales. What is curious though, is the original series and genres of the game, Super Mario series (Platform) and Mario Kart series (Racing), proved to be much more popular than the subsequent attempts to bring the franchise into other genres, like Sports, Puzzle or Role-Playing.
Final Fantasy Saga is a good example of the fact that brand name and a high quality past games are not the sole recipe for success. While the first games of the saga (Final Fantasy VII and Final Fantasy VIII) were a huge hits, the subsequent trend is decreasing, with most of the games from 2005 on selling below average.
In the Final Fantasy Saga evolution, we can see that certain platforms appear and disappear with time. Let’s have a closer look at this relationship.
From the above graph we can clearly see the cycles of console generations, where the newest models replace the oldes ones, and therefore the latest games are produced for the newest console models. The platforms that are on the rise as per 2016 are XOne from Microsoft and PS4 from Sony.
Let’s fit a logistic regression model to predict whether a specific game will become a bestseller or not.
Let’s start by splitting the data into the training and testing sets and fitting a logistic regression with the following predictor variables:
##
## Call:
## glm(formula = Bestseller ~ Critic_Score + User_Score + Genre +
## Platform + Year, family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9299 -0.6170 -0.3513 -0.1391 3.6169
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -16.268892 324.744067 -0.050 0.960045
## Critic_Score 0.110559 0.005051 21.889 < 2e-16 ***
## User_Score -0.017174 0.004180 -4.109 3.98e-05 ***
## GenreAdventure -1.182956 0.336491 -3.516 0.000439 ***
## GenreFighting -0.282044 0.186075 -1.516 0.129582
## GenreMisc 0.170572 0.174509 0.977 0.328350
## GenrePlatform 0.117624 0.179483 0.655 0.512243
## GenrePuzzle -0.198313 0.306332 -0.647 0.517388
## GenreRacing -0.116706 0.160404 -0.728 0.466874
## GenreRole-Playing -0.419506 0.153122 -2.740 0.006150 **
## GenreShooter 0.137560 0.133606 1.030 0.303201
## GenreSimulation 0.130330 0.213172 0.611 0.540946
## GenreSports -0.576819 0.134724 -4.281 1.86e-05 ***
## GenreStrategy -1.310751 0.314492 -4.168 3.08e-05 ***
## PlatformDC -1.992679 0.945844 -2.107 0.035137 *
## PlatformDS 0.005264 0.355826 0.015 0.988196
## PlatformGBA -0.893862 0.422054 -2.118 0.034185 *
## PlatformGC -1.163700 0.413940 -2.811 0.004935 **
## PlatformPC -1.971529 0.366671 -5.377 7.58e-08 ***
## PlatformPS -0.798575 0.533424 -1.497 0.134374
## PlatformPS2 -0.147572 0.365273 -0.404 0.686210
## PlatformPS3 0.286110 0.321560 0.890 0.373596
## PlatformPS4 1.296125 0.415212 3.122 0.001799 **
## PlatformPSP -0.525334 0.373050 -1.408 0.159068
## PlatformPSV -1.201352 0.564828 -2.127 0.033425 *
## PlatformWii 0.567358 0.345941 1.640 0.100997
## PlatformWiiU 0.067380 0.452797 0.149 0.881705
## PlatformX360 0.248401 0.324132 0.766 0.443464
## PlatformXB -1.718751 0.403356 -4.261 2.03e-05 ***
## PlatformXOne 0.974060 0.427663 2.278 0.022748 *
## Year1992 -2.469574 459.257006 -0.005 0.995710
## Year1996 9.807653 324.745201 0.030 0.975907
## Year1997 9.538832 324.744603 0.029 0.976567
## Year1998 9.265539 324.744453 0.029 0.977238
## Year1999 9.141267 324.744345 0.028 0.977543
## Year2000 8.119093 324.744059 0.025 0.980054
## Year2001 8.891254 324.743882 0.027 0.978157
## Year2002 8.666601 324.743869 0.027 0.978709
## Year2003 8.626824 324.743865 0.027 0.978807
## Year2004 8.923528 324.743860 0.027 0.978078
## Year2005 8.082170 324.743861 0.025 0.980144
## Year2006 8.009272 324.743851 0.025 0.980323
## Year2007 8.337090 324.743835 0.026 0.979518
## Year2008 8.463030 324.743837 0.026 0.979209
## Year2009 7.732002 324.743853 0.024 0.981005
## Year2010 8.331334 324.743847 0.026 0.979532
## Year2011 8.011568 324.743852 0.025 0.980318
## Year2012 8.089395 324.743862 0.025 0.980127
## Year2013 8.071318 324.743877 0.025 0.980171
## Year2014 7.607094 324.743939 0.023 0.981311
## Year2015 7.130425 324.743991 0.022 0.982482
## Year2016 6.187811 324.744040 0.019 0.984798
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5287.3 on 5514 degrees of freedom
## Residual deviance: 4097.1 on 5463 degrees of freedom
## AIC: 4201.1
##
## Number of Fisher Scoring iterations: 11
We see that there are a number of statistically significant predictor variables, such as Critic Score, User Score as well as some of the genres and platforms.
Now let’s check our model for multicollinearity using VIF to make sure we don’t have variables that are highly correlated with each other.
## GVIF Df GVIF^(1/(2*Df))
## Critic_Score 1.527401 1 1.235881
## User_Score 1.596800 1 1.263645
## Genre 1.583116 11 1.021101
## Platform 115.312168 16 1.159935
## Year 98.486223 22 1.109951
Platform and Year of Release variables have a VIF of more than 10, meaning that there is a multicollinearity issue in our model. Earlier we saw there is a correlation between the year of release and the platform, since the newest games tend to get released on the latest generation of consoles. We will drop Year of Release in order to remove one of the mutually correlated variables.
##
## Call:
## glm(formula = Bestseller ~ Critic_Score + User_Score + Genre +
## Platform, family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8533 -0.6301 -0.3637 -0.1456 3.7026
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.519695 0.452287 -18.837 < 2e-16 ***
## Critic_Score 0.109401 0.004932 22.180 < 2e-16 ***
## User_Score -0.015359 0.004009 -3.832 0.000127 ***
## GenreAdventure -1.172063 0.333626 -3.513 0.000443 ***
## GenreFighting -0.270969 0.184273 -1.470 0.141434
## GenreMisc 0.225718 0.172627 1.308 0.191027
## GenrePlatform 0.157199 0.176915 0.889 0.374243
## GenrePuzzle -0.136924 0.303074 -0.452 0.651423
## GenreRacing -0.068704 0.156666 -0.439 0.660996
## GenreRole-Playing -0.388357 0.150842 -2.575 0.010036 *
## GenreShooter 0.171692 0.131268 1.308 0.190889
## GenreSimulation 0.229868 0.207391 1.108 0.267700
## GenreSports -0.523713 0.131739 -3.975 7.03e-05 ***
## GenreStrategy -1.238508 0.312723 -3.960 7.48e-05 ***
## PlatformDC -1.204736 0.860489 -1.400 0.161495
## PlatformDS 0.389626 0.326289 1.194 0.232435
## PlatformGBA -0.042424 0.357210 -0.119 0.905461
## PlatformGC -0.403175 0.350678 -1.150 0.250267
## PlatformPC -1.629385 0.344163 -4.734 2.20e-06 ***
## PlatformPS 0.291387 0.368431 0.791 0.429010
## PlatformPS2 0.566402 0.303879 1.864 0.062335 .
## PlatformPS3 0.582168 0.307008 1.896 0.057926 .
## PlatformPS4 0.477694 0.340350 1.404 0.160456
## PlatformPSP -0.194509 0.343374 -0.566 0.571079
## PlatformPSV -1.222290 0.555320 -2.201 0.027732 *
## PlatformWii 0.902364 0.320502 2.815 0.004871 **
## PlatformWiiU -0.042814 0.436672 -0.098 0.921896
## PlatformX360 0.560877 0.306871 1.828 0.067590 .
## PlatformXB -0.975440 0.339820 -2.870 0.004099 **
## PlatformXOne 0.203193 0.363202 0.559 0.575856
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5287.3 on 5514 degrees of freedom
## Residual deviance: 4177.9 on 5485 degrees of freedom
## AIC: 4237.9
##
## Number of Fisher Scoring iterations: 6
From the model coefficients, we can see that for each one unit increase in critic score, the game is 1.12 times more likely to be a bestseller (1.12 is an exponential of critic score coefficient of 0.109401).
## [1] "Accuracy 0.820159535895576"
Our model classified correctly 82% of the testing set. Let’s build a confusion matrix to see more in detail to which kind of errors this model is prone.
## Confusion Matrix and Statistics
##
##
## fitted.results 0 1
## 0 1048 211
## 1 37 83
##
## Accuracy : 0.8202
## 95% CI : (0.7989, 0.8401)
## No Information Rate : 0.7868
## P-Value [Acc > NIR] : 0.00116
##
## Kappa : 0.3165
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.28231
## Specificity : 0.96590
## Pos Pred Value : 0.69167
## Neg Pred Value : 0.83241
## Prevalence : 0.21320
## Detection Rate : 0.06019
## Detection Prevalence : 0.08702
## Balanced Accuracy : 0.62411
##
## 'Positive' Class : 1
##
Confusion matrix reveals that there is a much higher proportion of false negatives (15.3% - 211 games that are bestsellers but were incorrectly classified as not bestsellers) than false positives (2.7% - 37 games that are not bestsellers and were classified as ones). Since we want to be conservative about our predictions, this scenario is better than if we had a high proportion of false positives.
## Precision: 0.6916667
## Recall: 0.2823129
High proportion of false negatives results in low recall (aka sensitivity) of 28.2%, meaning that out of all games that are bestsellers, only 28.2% were correctly classified as such. The rest were incorrectly classified as being not bestsellers.
Specificity is quite high (96.6%) meaning that most of the non-bestsellers are correctly classified as such.
Precision of the model is that of 69.2%, meaning that out of all the cases classified as bestsellers, 69.2% are actually bestsellers.
Let’s see what predictions our model will give for the followin recently released games:
## Name Platform Genre Critic_Score User_Score
## 1 Red Dead Redemption 2 PS4 Action 97 78
## 2 Spyro Reignited Trilogy XOne Platform 82 33
## 3 Fallout 76 PC Role-Playing 59 28
## Name predict bestseller
## 1 Red Dead Redemption 2 0.79768716 1
## 2 Spyro Reignited Trilogy 0.57560291 1
## 3 Fallout 76 0.01084852 0
Looks like our model predicts Red Dead Redemption 2 for PS4 and Spyro Reignited Trilogy for XOne to be bestsellers, while Fallout 76 for PC is classified as non-bestseller. If we look at the probability, even though both Red Dead Redemption 2 and Spyro Reignited Trilogy are classified as bestsellers, the model is much more sure about the first one becoming bestseller (0.8 probability vs 0.58 probability).
We will have to wait for some months to see whether our predictions have turned out to be true.
While some franchises and developers are definitely more successful than others, making use of a brand name does not guarantee success. Some sagas have started high, but became less popular over time (Final Fantasy), while others were highly popular in some genres, but failed to expand the franchise successfully into other genres (Super Mario franchise).
There is a high correlation between Platform and Year of Release which makes sense, as the newest games are primarily released for the latest console generations. Due to this correlation, Year of Release was removed from the model to avoid multicollinearity.
Logistic regression model was created in order to predict whether a game will be a bestseller based on its Genre, Platform, Critic Score and User Score.
Model’s prediction accuracy is that of 82%, with high specificity (96.6%) and low sensitivity (28.2%). The main weakness of the model is its high false negatives rate, meaning that many games that are actually bestsellers are classified as non-bestsellers.
Understanding the relationship between the critic score and the global sales is important, since critic score will be one of the main predictor variables in our predictive model. The scatterplot and the fitted line show that there is a slight positive correlation, meaning that the higher the critic score the more game copies are sold.
When analyzing a game’s performance it is important to take into account market preferences. Selling a game in the US market is not exactly the same as in a Japanese market, and this becomes even more clear when we look at the distributions of sales by genre in each region.
Some of the best selling genres in the US (Sports and Shooter) are one of the worst selling in Japan, whereas an unpopular within american gamers genre of Puzzle is selling quite well compared to other genres in Japan.
Another important aspect of games market is the power of the brand name and game’s developer know-how. Such classic franchises as Super Mario have been a huge success for the past two decades, creating dozens of titles for different platforms and expanding into various genres. However, even such hits can have their highs and lows, and figuring out the target audiences and their preferences for genres and platforms is important even if you are Nintendo.
The graph shows all titles from Super Mario franchise released between 2000 and 2016. The colour of the dot indicates the genre, and the titles are ordered by year of release and the vertical axis shows the global sales they yielded. We can see which titles were more successful in terms of sales and we can also spot that Platform and Racing genres sell better than Sports or Puzzle in case of Super Mario franchise.
Whether video games are a form of art or just a source of entertainment is a long lasting debate. But I was interested in taking a more analytical approach to what it takes to make a great game. Is it the creators themselves and their artistic skills and know-how? Or maybe it is the brand name of a franchise that translates into high sales? These and many other questions about the games industry were driving my analysis.
The dataset was taken from Kaggle and it combines 2 datasets coming from different websites dedicated to games (vgchartz.com and Metacritic.com). This is important to keep in mind for the first stage of analysis where the data was cleaned in order to remove any missing values or incorrect formats. The source of the data also conditions the conclusions we can drive from it. Since the websites audience is primarily from the US, such data points as critic score and user score most probably represent more the american audience. And the selection of the games themselves is also affected by this bias, since more local games that are popular in other regions rather than the US, are likely to be underrepresented in the sample.
The exploratory analysis of the data revealed some interesting insights. I was surprised to find out that critic score is more highly correlated with the global sales than the users score. Even though the users might like the game it does not necessarily mean they are willing to pay for it (which happens a lot with Strategy games that tend to be more popular on PC and are more prone to piracy because of that). It was also interesting to see how some differences between markets were revealed by the data analysis.
Finally, the model was built to predict whether a specific game is a bestseller or not based on the critic and user score, the genre and the platform. While the accuracy of 82% was achieved with little tweaks to the model, there is a lot of potential for improvement. Some limitations of this model include the source of the data discussed above (the games sample is biased towards the US market) and the amount of entries with missing data.
Potential improvements can be made by adding more variables to the model (for instance Publisher and Developer), the main obstacle being the amount of different publishers and developers represented in the dataset. Another possible direction of analysis would be to group games by their franchises (implicit in the game’s name) and see whether this has effects on the accuracy of predictions.
The analysis could be also replicated for other markets by scraping local websites dedicated to games. Another machine learning models (for example Random Forest Classifier) can be applied to the data to see whether the predictions are more accurate.